近年来,伴随着数字化的应用与服务在各行各业得到广泛的应用,基础的IT系统在容量和复杂性方面不断增长。系统故障变得不可避免,导致服务性能下降甚至服务中断,由此带来严重的系统可靠性隐患。本次报告将回顾我们在构建可靠性驱动的智能化运维框架的经验。针对系统软硬件日志、度量数据、系统拓扑关系、系统告警以及系统工单等原始运维数据,我们通过数据驱动的方式,提出相应的智能化解决方案,以达到异常检测、故障诊断、根因定位以及故障预测等目标,最终增强系统整体可靠性。
随着千行百业数字化转型升级,云计算、互联网、5G等新一代计算与网络技术与经济社会交融共生,算力和网络已经成为影响产业发展的核心关键因素,泛在互联的网络、无所不在的计算,推动云计算、网络与智能计算等加速融合,云网智能协同技术日益成为学术界和产业界关注的焦点。我国对此高度重视,启动了以“东数西算”为代表的国家网算融合工程。在“东数西算”时代,如何充分优化计算和网络资源部署,实现云数据中心、边缘数据中心、智能计算平台、网络之间的协同,成为当前的新挑战。本论坛将邀请国内外院士、知名高校学者以及华为/阿里、中国移动/联通等行业企业的知名专家,对云网智能协同技术进行深入研讨,探寻互联网的未来之路。
In recent years, online service systems have become increasingly popular. Incidents of these systems could cause significant economic loss and customer dissatisfaction. Incident triage, which is the process of assigning a new incident to the responsible team, is vitally important for quick recovery of the affected service. Our industry experience shows that in practice, incident triage is not conducted only once in the beginning, but is a continuous process, in which engineers from different teams have to discuss intensively among themselves about an incident, and continuously refine the incident-triage result until the correct assignment is reached. In particular, our empirical study on 8 real online service systems shows that the percentage of incidents that were reassigned ranges from 5.43% to 68.26% and the number of discussion items before achieving the correct assignment is up to 11.32 on average. To improve the existing incident triage process, in this paper, we propose DeepCT, a Deep learning based approach to automated Continuous incident Triage. DeepCT incorporates a novel GRU-based (Gated Recurrent Unit) model with an attention-based mask strategy and a revised loss function, which can incrementally learn knowledge from discussions and update incident-triage results. Using DeepCT, the correct incident assignment can be achieved with fewer discussions. We conducted an extensive evaluation of DeepCT on 14 large-scale online service systems in Microsoft. The results show that DeepCT is able to achieve more accurate and efficient incident triage, e.g., the average accuracy identifying the responsible team precisely is 0.641~0.729 with the number of discussion items increasing from 1 to 5. Also, DeepCT statistically significantly outperforms the state-of-the-art bug triage approach.
当代社会生产生活的许多方面都依赖于大型复杂的软硬件系统, 包括互联网、高性能计算、电信、金融、电力网络、物联网、 医疗网络和设备、航空航天、军用设备及网络等。这些系统的用户都期待有好的体验。 因而,这些复杂系统的部署、运行和维护都需要专业的运维人员,以应对各种突发事件,确保系统安全、可靠地运行。由于各类突发事件会产生海量数据,因此,智能运维从本质上可以认为是一个大数据分析的具体场景。
报告将介绍并行科技7*24小时数据中心在线运维服务系统,可以将分布在不同地域几十万台服务器的实时运行数据传送至集中运维监控中心,通过自动分析软件和自学习专家库,自动识别已知的各种系统软硬件故障和潜在风险,由专业IT服务人员主动、直接修复远程数据中心故障,同时全自动完成基于大数据的海量应用运行特征数据分析,直接给出数据中心系统选型优化方案,极大提高了数据中心运营管理效率。
CCF互联网专委会主任
国防科技大学
CCF互联网专委会执行委员
南开大学
CCF互联网专委会秘书长
国防科技大学
CCF互联网专委会副秘书长
桂林电子科技大学
CCF互联网专委会执行委员
东北大学
CCF互联网专委会执行委员
南开大学